1 Introduction and objectives

In this hands-on session, we will show how principal component analysis (PCA) is used in common bioinformatics workflows. We will use a toy single-cell RNA-seq (scRNA-seq) dataset to illustrate it.

The main objectives are the following:

  • Understand the intuition behind PCA.
  • Learn to compute PCA in 3 ways in R: by eigendecomposition of the covariance matrix, by singular value decomposition and using the prcomp function.
  • Perform a case-study using scRNA-seq data.

2 Introduction to scRNA-seq

Single-cell RNA-seq allows the gene expression profiling of thousands of cells, one a time. It is often exemplified using the following analogy:

  • Bulk RNA-seq would be analogous to a fruit smoothie, in which all we get is an average taste out of several fruits. Analogously, in bulk RNA-seq all we get is an average transcriptome across a population of cells.
  • scRNA-seq, on the other hand, is analogous to a fruit salad, in which you can taste one fruit at a time. Here, we get a transcriptome for each single cell, so we can understand cell-to-cell variability in gene expression.

The scRNA-seq pipeline can be summarized in these 4 steps:

Image extracted from this paper.

  1. Sample preparation: cells are dissociated from their original tissue into single-cells using enzymatic or mechanical treatments. In addition, we can use FACS to enrich for a cell type of interest.
  2. Single-cell RNA-sequencing, which includes: single-cell capture (into droplets or wells), lysis, mRNA capture, reverse transcription, PCR amplification and Illumina sequencing.
  3. Data processing: map the fastq reads to obtain the gene expression matrix (cells by genes).
  4. Data analysis, which includes clustering, differential expression analysis, gene regulatory networks, etc.

3 Why use PCA?

scRNA-seq is a double-edged sword. On the one hand, scRNA-seq provides unprecedented discriminative power to chart all the cell types and states in a given tissue. This has catalyzed the creation of the Human Cell Atlas (HCA), which aims to classify all cells in the human body using scRNA-seq; and has been coined as ā€œthe periodic table of human cellsā€.

On the other hand, however, scRNA-seq has a high degree of technical variability, which can be introduced at multiple points of the aforementioned pipeline. Some examples include the following:

  • Dissociation artifacts: enzymatic treatment (ie collagenase) is frequently used to dissociate tissues into single-cells. This can induce cellular stress that can be confounded by the variable of interest. Described in detail in this paper.
  • Doublets: Two cells that are processed together in a reaction volume (e.g., a well or droplet) and receive the same single-cell barcode.
  • Dropouts: Transcripts that are not detected in the final dataset even though the gene is expressed in the cell, leading to false zero values in the expression matrix. (Definitions extracted from table 1 of this paper).

In this scenario, PCA is used for 3 main purposes:

  1. Reduce the noise of the dataset: by finding the orthogonal axis that maximize the variance, PCA eliminates noise introduced by dropouts, transcriptional bursts, etc.
  2. Reduce the redundancy: as we see below, there is a vast degree of redundancy across gene sets. Thus, principal components can be interpreted as a ā€œmetageneā€: a score that summarizes information from a correlated set of genes.
  3. Reduce computational complexity: since we reduce the dimensionality, operations downstream such as clustering will be executed faster.

4 Case-study

As an example, we will create an expression matrix with the following cells:

  • 4 T-cells (identified by high expression of CD3D and CD3E).
  • 3 monocytes (identified by high expression of LYZ and S100A8).
  • 3 naive B-cells (identified by high expression of CD79A, CD79B and BLNK).
  • 2 plasma cells (identified by B-cell and proliferation markers, such as TOP2A or MKI67).
  • 2 poor-quality cells (identified by high mitochondrial expression). If a cell has pores in the membrane due to cell lysis, the cytosolic mRNA will leak out of the cell; but if the diameter of mitochondria is higher than the pores, they will get trapped inside the cell.

4.1 Load packages

library(pheatmap)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## āœ“ ggplot2 3.3.2     āœ“ purrr   0.3.4
## āœ“ tibble  3.0.3     āœ“ dplyr   1.0.2
## āœ“ tidyr   1.1.2     āœ“ stringr 1.4.0
## āœ“ readr   1.3.1     āœ“ forcats 0.5.0
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

4.2 Create and visualize toy dataset

# Create toy dataset
toy <- data.frame(
  CD3D = c(4, 5, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
  CD3E = c(5, 3, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
  LYZ = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
  S100A8 = c(0, 0, 0, 0, 5, 7, 6, 0, 0, 0, 0, 0, 0, 0),
  CD79A = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 4, 3, 5, 0, 0),
  CD79B = c(0, 0, 0, 0, 0, 0, 0, 5, 5, 3, 4, 6, 0, 0),
  BLNK = c(0, 0, 0, 0, 0, 0, 0, 6, 4, 5, 5, 4, 0, 0),
  TOP2A = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 9, 0, 0),
  MKI67 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 11, 10, 0, 0),
  MT_CO1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18, 23),
  MT_ND5 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22, 25) 
)
rownames(toy) <- paste("cell", as.character(1:nrow(toy)))
toy <- as.matrix(toy)


# Visualize
pheatmap(toy, cluster_cols = FALSE, cluster_rows = FALSE, angle_col = 45)

In the following image you can visualize the redundancy between CD79A and CD79B. Together, they form the heterodimer CD79; which means that for every molecule of CD79A we need one of CD79B. Under the hood they are the same variable, which should be captured by a single principal component. A good analogy would be to measure the height of a person in feet and in cm.

Images obtained from this link

4.3 PCA in R: 3 ways

Three ways to perform PCA in R:

  • Eigendecomposition of the covariance matrix.
  • Singular Value Decomposition.
  • prcomp function.

Regardless of the method, we always need to center the data:

toy_centered <- scale(toy, center = TRUE, scale = FALSE)

Now we can compute the 3 ways:

# Eigendecomposition of the covariance matrix
cov_matrix <- (1 / (nrow(toy_centered) - 1)) * (t(toy_centered) %*% toy_centered)
eigen_cov_matrix <- eigen(cov_matrix)
head(eigen_cov_matrix$values)
## [1] 133.1235625  36.7772802  14.4225675  10.9965640   0.4843188   0.2349899
eigen_cov_matrix$vectors[1:6, 1:6]
##             [,1]       [,2]       [,3]        [,4]       [,5]       [,6]
## [1,] -0.03621728 -0.1281224 -0.2892754  0.37018269 -0.1942142 -0.4178419
## [2,] -0.03621728 -0.1281224 -0.2892754  0.37018269 -0.1942142 -0.4178419
## [3,] -0.04319892 -0.1949563  0.5855971 -0.07543384 -0.1254525 -0.2728953
## [4,] -0.04319892 -0.1949563  0.5855971 -0.07543384 -0.1254525 -0.2728953
## [5,] -0.06565032  0.2467930 -0.1395316 -0.42583898 -0.3020970 -0.3628173
## [6,] -0.06999161  0.2975760 -0.1023968 -0.35142006 -0.6180140  0.1555483
pc_mat_by_eigen <- toy %*% eigen_cov_matrix$vectors
pc_mat_by_eigen[1:6, 1:6]
##              [,1]       [,2]      [,3]       [,4]      [,5]      [,6]
## cell 1 -0.3259555 -1.1531013 -2.603479  3.3316442 -1.747928 -3.760577
## cell 2 -0.2897383 -1.0249790 -2.314203  2.9614615 -1.553714 -3.342735
## cell 3 -0.2897383 -1.0249790 -2.314203  2.9614615 -1.553714 -3.342735
## cell 4 -0.2535210 -0.8968566 -2.024928  2.5912788 -1.359499 -2.924893
## cell 5 -0.4319892 -1.9495631  5.855971 -0.7543384 -1.254525 -2.728953
## cell 6 -0.6047848 -2.7293883  8.198360 -1.0560738 -1.756335 -3.820534
# SVD
toy_to_svd <- (1 / sqrt(nrow(toy) - 1)) * toy_centered
toy_svd <- svd(toy_to_svd)
head(toy_svd$d ** 2)
## [1] 133.1235625  36.7772802  14.4225675  10.9965640   0.4843188   0.2349899
# PCA
pca_out <- prcomp(toy, center = TRUE)


# The singular values, PC scores and gene loadings are in the sdev, x and
# rotation, respectively
head(pca_out$sdev ** 2)
## [1] 133.1235625  36.7772802  14.4225675  10.9965640   0.4843188   0.2349899
pca_out$x
##               PC1        PC2          PC3          PC4         PC5         PC6
## cell 1  -3.951320 -3.9486085  3.661022353  3.654520583  0.25850086 -0.47150505
## cell 2  -3.915103 -3.8204861  3.371746930  3.284337897  0.06428666 -0.05366314
## cell 3  -3.915103 -3.8204861  3.371746930  3.284337897  0.06428666 -0.05366314
## cell 4  -3.878885 -3.6923637  3.082471507  2.914155211 -0.12992755  0.36417876
## cell 5  -4.057353 -4.7450703 -4.798427533 -0.431462021 -0.23490185  0.56011951
## cell 6  -4.230149 -5.5248955 -7.140815964 -0.733197393  0.26690820 -0.53146153
## cell 7  -4.143751 -5.1349829 -5.969621749 -0.582329707  0.01600317  0.01432899
## cell 8  -4.800316  1.8496472  3.205914736 -6.480809141  0.07373512 -0.80981026
## cell 9  -4.525318  0.7972224  2.660452378 -4.798614066  0.58270447  0.81573100
## cell 10 -4.457184  0.4814898  2.588858340 -4.511032499 -1.20990530  0.05468109
## cell 11 -6.171637 13.6435992 -2.534752330  3.489637251 -1.30875857 -0.10273060
## cell 12 -6.073610 12.1698727 -1.293445808  0.955715383  1.57894297  0.11540916
## cell 13 24.327990  0.5451695  0.001508073  0.008452255 -0.18771233  0.65946431
## cell 14 29.791738  1.1998922 -0.206657864 -0.053711651  0.16583749 -0.56107911
##                   PC7          PC8          PC9         PC10          PC11
## cell 1  -7.071068e-01  0.123135176 -0.175028076 -0.055146264 -8.459997e-17
## cell 2   1.414214e+00  0.003214116  0.002251261  0.003483378 -3.066446e-16
## cell 3   8.703355e-16  0.003214116  0.002251261  0.003483378  2.642233e-17
## cell 4  -7.071068e-01 -0.116706945  0.179530598  0.062113019  4.705115e-16
## cell 5  -2.002961e-14 -0.168732189  0.254045449  0.085826011 -3.066446e-16
## cell 6   2.232540e-14  0.146236682 -0.212530754 -0.068851789 -7.507338e-16
## cell 7   1.147891e-15 -0.011247753  0.020757348  0.008487111 -7.507338e-16
## cell 8   3.109616e-14  0.307995407 -0.051643310  0.116372232  1.104971e-16
## cell 9  -3.118735e-14 -0.461037674 -0.470049370 -0.030457398  8.107271e-16
## cell 10 -1.794200e-15  0.018243159  0.411680421 -0.120109433  2.426507e-16
## cell 11  4.950405e-15 -0.120543044 -0.283243462  0.023737601  3.613459e-16
## cell 12 -4.403224e-15  0.149102392  0.345292787 -0.028677522  1.860972e-16
## cell 13 -2.777342e-14  0.769768895 -0.136118692 -0.001458603  5.634938e-17
## cell 14  2.263071e-14 -0.642642339  0.112804538  0.001198280  3.788446e-16
pca_out$rotation
##                PC1         PC2         PC3          PC4          PC5
## CD3D   -0.03621728 -0.12812237  0.28927542  0.370182686  0.194214205
## CD3E   -0.03621728 -0.12812237  0.28927542  0.370182686  0.194214205
## LYZ    -0.04319892 -0.19495631 -0.58559711 -0.075433843  0.125452513
## S100A8 -0.04319892 -0.19495631 -0.58559711 -0.075433843  0.125452513
## CD79A  -0.06565032  0.24679303  0.13953157 -0.425838976  0.302097034
## CD79B  -0.06999161  0.29757597  0.10239682 -0.351420065  0.618014029
## BLNK   -0.07184864  0.27941939  0.13319961 -0.415258562 -0.556581707
## TOP2A  -0.07437979  0.57153188 -0.22431747  0.353096905 -0.237385930
## MKI67  -0.07432314  0.56844945 -0.21769693  0.335371635  0.221260081
## MT_CO1  0.64897128  0.07824766 -0.02520615 -0.007577383  0.069159859
## MT_ND5  0.73963051  0.08782812 -0.02737840 -0.008092330  0.002583509
##                 PC6           PC7         PC8        PC9        PC10
## CD3D   -0.417841906  7.071068e-01  0.11992106 -0.1772793 -0.05862964
## CD3E   -0.417841906 -7.071068e-01  0.11992106 -0.1772793 -0.05862964
## LYZ    -0.272895260  1.082467e-14  0.07874222 -0.1166441 -0.03866945
## S100A8 -0.272895260  1.035283e-14  0.07874222 -0.1166441 -0.03866945
## CD79A  -0.362817278  1.368350e-14  0.32241549  0.4184122  0.48172172
## CD79B   0.155548283 -5.967449e-15 -0.20858989 -0.5454695 -0.15932743
## BLNK   -0.449953352  1.745826e-14  0.06210105 -0.2092092 -0.40830690
## TOP2A  -0.063532910  2.581269e-15 -0.07320320 -0.3655626  0.54170097
## MKI67   0.007874334 -4.163336e-16  0.11551433  0.4232463 -0.51959231
## MT_CO1 -0.338627354  1.454392e-14 -0.64733602  0.1811808  0.02643551
## MT_ND5  0.157531117 -7.438494e-15  0.60808962 -0.2189936 -0.04317356
##                 PC11
## CD3D   -4.509519e-16
## CD3E   -1.153023e-16
## LYZ     7.071068e-01
## S100A8 -7.071068e-01
## CD79A   2.766429e-16
## CD79B  -2.934072e-17
## BLNK   -6.267579e-16
## TOP2A   6.589544e-16
## MKI67  -6.802521e-16
## MT_CO1  2.499914e-16
## MT_ND5 -3.091539e-16
# PCA centered and scaled
pca_out_scaled <- prcomp(toy, center = TRUE, scale. = TRUE)

Importantly, we obtain the following:

  • Gene loadings: pca_out$rotation
  • PC scores: pca_out$x
  • Variance associated to each PC: pca_out$sdev ** 2

In addition, we can get the proportion of variance explained (PVE) and the cumulative proportion with the summary function (more info in this book):

summary(pca_out_scaled)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     2.1414 1.6160 1.5448 1.1065 0.28172 0.25075 0.19522
## Proportion of Variance 0.4169 0.2374 0.2170 0.1113 0.00722 0.00572 0.00346
## Cumulative Proportion  0.4169 0.6543 0.8712 0.9826 0.98977 0.99548 0.99895
##                            PC8     PC9    PC10      PC11
## Standard deviation     0.09752 0.04135 0.01862 1.383e-16
## Proportion of Variance 0.00086 0.00016 0.00003 0.000e+00
## Cumulative Proportion  0.99981 0.99997 1.00000 1.000e+00

4.4 Infer dimensionality of the dataset

To reduce the dimensionality, we will choose the number of PC that explains most of the variance in the dataset. To this end, we use a so-called elbow plot:

# Calculate percentage of variance explained (PVE):
pve <- pca_out_scaled$sdev ** 2 / sum(pca_out$sdev ** 2) * 100
pve <- round(pve, 2)
pve_df <- data.frame(principal_component = 1:length(pve), pve = pve)
pve_gg <- pve_df %>%
  ggplot(aes(principal_component, pve)) +
    geom_point() +
    geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
    scale_x_continuous(breaks = 1:length(pve)) +
    labs(x = "Principal Component", y = "Percentage of Variance Explained (%)") +
    theme_bw()
pve_gg

Some people prefer to plot the cumulative percentage of explained variance:

summ_pca <- summary(pca_out_scaled)
cum_pct <- summ_pca$importance["Cumulative Proportion", ] * 100
cum_pct_df <- data.frame(principal_component = 1:length(pve), cum_pct = cum_pct)
cum_pct_gg <- cum_pct_df %>%
  ggplot(aes(principal_component, cum_pct)) +
    geom_point() +
    geom_vline(xintercept = 4.5, linetype = "dashed", color = "darkblue") +
    scale_x_continuous(breaks = 1:length(cum_pct)) +
    scale_y_continuous(limits = c(0, 100)) +
    labs(x = "Principal Component", y = "Cumulative Percentage of Variance Explained (%)") +
    theme_bw()
cum_pct_gg

Thus, we conclude that the first 4 PC are significant. Let us express each cell as a vector of the first 4 PC:

toy_reduced <- pca_out_scaled$x[, c("PC1", "PC2", "PC3", "PC4")]
gene_loads_reduced <- pca_out_scaled$rotation[, c("PC1", "PC2", "PC3", "PC4")]
pheatmap(toy_reduced, cluster_rows = FALSE, cluster_cols = FALSE, angle_col = 45)

4.5 Visualize gene loadings

Now we have reduced the dimensionality of the dataset. To interpret each PC, let us inspect the gene loadings:

significant_pcs <- c("PC1", "PC2", "PC3", "PC4")
loadings_gg <- map(significant_pcs, function(x) {
  loadings <- gene_loads_reduced[, x]
  df <- data.frame(gene = names(loadings), score = loadings)
  p <- df %>%
    ggplot(aes(fct_reorder(gene, score), score)) +
      geom_point() +
      labs(x = "", y = x) +
      theme_bw() +
      coord_flip()
  return(p)
})
loadings_gg
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

Interpretation of the PC:

  • PC1: B cell identity.
  • PC2: differences between monocytes and T cells.
  • PC3: technical variation.
  • PC4: cell cycle.

4.6 Cluster cells

Clustering means classifying observations into groups that minimize and maximize intra-group and inter-group distances, respectively.

4.6.1 Calculate all pairwise Euclidean distances

To that end we need a concept of distance, which basically measures how similar or different two vectors are. In our case we will use Euclidean distance, but be aware that there are a plethora of different options that can yield different clustering results.

Let us start by computing all pairwise distances:

dist_mat <- dist(toy_reduced, method = "euclidean")
pheatmap(dist_mat, cluster_rows = FALSE, cluster_cols = FALSE)

4.7 Perform hierarchical clustering

hclust_average <- hclust(dist_mat, method = "average")
plot(
  hclust_average,
  labels = rownames(toy),
  main = "Average Linkage",
  xlab = "",
  sub = "",
  ylab = ""
)

4.7.1 Cut the dendrogram and visualize clusters

plot(
  hclust_average,
  labels = rownames(toy),
  main = "Average Linkage",
  xlab = "",
  sub = "",
  ylab = ""
)
abline(h = 2.5, col = "red")

hc_clusters <- cutree(hclust_average, h = 2.5)
hc_clusters
##  cell 1  cell 2  cell 3  cell 4  cell 5  cell 6  cell 7  cell 8  cell 9 cell 10 
##       1       1       1       1       2       2       2       3       3       3 
## cell 11 cell 12 cell 13 cell 14 
##       4       4       5       5
table(hc_clusters)
## hc_clusters
## 1 2 3 4 5 
## 4 3 3 2 2
annotation_rows <- data.frame(cluster = as.character(hc_clusters))
rownames(annotation_rows) <- names(hc_clusters)
pheatmap(
  toy_centered,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  angle_col = 315,
  annotation_row = annotation_rows
)

4.7.2 Annotation

Cluster Cell type
1 T-cells
2 Monocytes
3 Naive B-cells
4 Plasma Cells
5 poor-quality cells
annotated_clusters <- c("T-cells", "Monocytes", "Naive B-cells", "Plasma Cells", "poor-quality cells")
names(annotated_clusters) <- as.character(1:5)
annotation_rows$annotation <- annotated_clusters[annotation_rows$cluster]
pheatmap(
  toy_centered,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  angle_col = 315,
  annotation_row = annotation_rows
)

5 Session Information

sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.0    stringr_1.4.0    dplyr_1.0.2      purrr_0.3.4     
##  [5] readr_1.3.1      tidyr_1.1.2      tibble_3.0.3     ggplot2_3.3.2   
##  [9] tidyverse_1.3.0  pheatmap_1.0.12  BiocStyle_2.16.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0    xfun_0.18           haven_2.3.1        
##  [4] colorspace_1.4-1    vctrs_0.3.4         generics_0.0.2     
##  [7] htmltools_0.5.0     yaml_2.2.1          blob_1.2.1         
## [10] rlang_0.4.7         pillar_1.4.6        withr_2.3.0        
## [13] glue_1.4.2          DBI_1.1.0           RColorBrewer_1.1-2 
## [16] dbplyr_1.4.4        modelr_0.1.8        readxl_1.3.1       
## [19] lifecycle_0.2.0     munsell_0.5.0       gtable_0.3.0       
## [22] cellranger_1.1.0    rvest_0.3.6         evaluate_0.14      
## [25] labeling_0.3        knitr_1.30          fansi_0.4.1        
## [28] broom_0.7.1         Rcpp_1.0.5          scales_1.1.1       
## [31] backports_1.1.10    BiocManager_1.30.10 magick_2.4.0       
## [34] jsonlite_1.7.1      farver_2.0.3        fs_1.5.0           
## [37] hms_0.5.3           digest_0.6.25       stringi_1.5.3      
## [40] bookdown_0.20       grid_4.0.1          cli_2.0.2          
## [43] tools_4.0.1         magrittr_1.5        crayon_1.3.4       
## [46] pkgconfig_2.0.3     ellipsis_0.3.1      xml2_1.3.2         
## [49] reprex_0.3.0        lubridate_1.7.9     rstudioapi_0.11    
## [52] assertthat_0.2.1    rmarkdown_2.4       httr_1.4.2         
## [55] R6_2.4.1            compiler_4.0.1